Bioinformatics Data Skills

Utah Valley University - BIOL490R (Special Topics)


Computer requirements for this class: CLICK HERE

Command Line Projects and the Unix Philosophy

Week 1

Ideology of ‘Robust and Reproducible’ Bioinformatics

Topics:

  • What are “data skills?” | Reproducibility and open science | How to learn bioinformatics | Documentation | The importance of caution

Assignments:

  • Read through BDS Chapter 1… twice, and carefully
  • Find and explore the supplemental materials for the chapter on GitHub
  • Go through the resources below (Do this every week before class!)
  • Assignment 1 - Reflection piece on why you want to learn command line skills and best practices
  • Set up your computer environment (Command-line, Git)

Resources

Practice

  • Make sure you’ve watched the videos above and can navigate in your command line terminal.

  • Do you know what the following commands do?

      pwd
    
      cd ~
    
      cd ..
    
      ls -a
    
      ls -l

For your consideration:

  • “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” –Brian Kernighan
  • “Since the computer is a sharp enough tool to be really useful, you can cut yourself on it.” – John Tukey

Back to top of page


Week 2

Proper Project Organization

Topics:

  • One directory per project | data as ‘read-only’ | rules for naming things | project structure | documentation

Assignments:

  • Read through BDS Chapter 2 at least once
  • Work through BDS Chapter 2, following along in your own terminal
  • Assignment 2 - Create organized project template using code

Resources

Practice

  • Re-create your project directory template by copy-pasting each line of code from your assignment to make sure it gives the same result
  • Spend time making sure that you intuitively understand relative filepaths and get comfy with the terminal
  • Spend 2-3 hours mucking about in your terminal reworking the lines from Chapter 2 over and over until it feels normal

For your consideration:

  • If you are learning to play the piano, and you settle for a couple hours a week of instruction without practicing on your own, you’re gonna be a really crappy piano player, like me. –Geoff Zahn

Back to top of page

Unix refresher and sequence data types

Week 3

The Unix Shell

Topics:

  • The Unix philosophy | text streams | pipes and redirection | process control | process substitution

Assignments:

  • Read through BDS Chapter 3
  • Work through BDS Chapter 3, following along in your own terminal
  • Assignment 3 - Running shell scripts, redirecting, pipes, background processes
  • Read/watch ALL of the resouces below. Be able to write a for-loop.

Resources

Practice

For your consideration:

  • “This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” –Doug McIlroy

Back to top of page


Week 4

Working with Sequence Data

Topics

  • fasta and fastq file formats | using existing tools to work with sequence data

Assignments:

  • Read through BDS Chapter 10 at least once
  • Don’t work through the examples yet (we can return to them once we have more skills)
  • Assignment 4 - converting between formats, inspecting and trimming reads, using pre-made command-line tools

Resources

Practice

  • How many sequences are stored (in total) in the fastq files associated with Assignment_4?

  • How many sequences end with the seqeunce “AT” in each fastq file?

  • Which fastq file associated with Assignment_4 contains the following sequence:

      CCTTCATGCTGTCCTGCAATTACGATAGCATTTCTTTGACGACGAC

For your consideration:

  • “Treat data as read-only.” –Vince Buffalo
  • Never directly edit any fasta or fastq file! If you have to make edits, redirect them to a new version of the raw file.

Back to top of page

Using Existing Tools in the Command Line

Week 5

Combining Unix Skills and Command-Line Software

Topics:

  • Interfacing with command-line tools | redirecting stdout and stderr | customizing parameters

Assignments:

Resources

Back to top of page

More Powerful Unix Tools

Week 6

Unix Data Tools

Topics:

Assignments:

Resources

  • Introduction to regular expressions video
  • sed video playlist Definitely worth your time!

Practice

Back to top of page


Week 7

Unix Data Tools, Continued

Topics:

  • More handy shell programs: cut, paste, sort, uniq, tr, rename, tee, xargs, awk
  • Manipulating text data from one format to another

Assignments:

  • Continue working through BDS Chapter 7
  • Assignment 6 - convert between tabular and fasta formatted data | process/command substitution | advanced paste

Resources

  • “Process substitution” vs “command substitution” VIDEO
  • Using paste to build fasta from tsv video

Practice

  • Here’s an awful-looking one-line command that prints out the phylum from each line of Chapter_7_Practice_File_2.txt along with a number sequence next to it showing which line of the file it came from.

  • It uses both process and command substitution, but essentially, it’s just the paste command pasting together the phylum in the first field and the numbers 1-34 in the second field

  • I want you to break it apart, looking at each component and understand why it works!

      paste <(cat Chapter_7_Practice_File_2.txt | cut -d ";" -f 2) <(seq $(wc -l Chapter_7_Practice_File_2.txt | cut -d " " -f 1))
  • If you wanted to use process substitution again to extend this whole command in order to add a header to the output, what would you do? (i.e., add a first row that is “PHYLUM LINE_NUMBER”)

Back to top of page

Finding and Retrieving Data

Week 8

Online Repositories and Approaches to Downloading

Topics:

  • NCBI / SRA
  • Searches, filters, metadata
  • Database files and formats
  • Documenting data acquisition
  • Checksums
  • File compression

Assignments:

  • Work through BDS Chapter 6

  • Case Study 2 - Reproducibly downloading stuff (BDS p. 120)

    • Full documentation
    • Checksums
    • Markdown README

Resources

Practice

Back to top of page

Working with Supercomputers

Week 9

Interfacing with Remote Machines

Topics:

  • tmux, ssh, public keys
  • navigating the HPC
  • good HPC citizenship
  • SLURM scripts and commands

Assignments:

  • Work through BDS Chapter 4 before class this week
  • Assignment 7 - build and submit 3 separate jobs on the HPC

Resources

Practice

Back to top of page


Week 10

Interfacing with Remote Machines, Continued

Topics:

  • Installing other software not found in “modules”
  • File transfers
  • Customizing your remote workspace

Assignments:

Resources:

  • sra-toolkit is available as a module on the CHPC, but you’ll need to configure it before use using

    vdb-config -i
  • prefetch instructions

  • fastq-dump instructions from the Edwards Lab

  • FileZilla is a free FTP client that really comes in handy for moving files to and from remote servers

Practice

  • See if you can get itsxpress to run

Back to top of page

Shell scripts

Week 11

TBD

Topics:

  • Let’s use this time to explore the Kudzu GBS data
  • We can also talk about bioinformatics collaborations and your role as a data expert

Assignments:

  • TBD

Resources

Practice

  • TBD

Back to top of page


Week 12

Bioinformatics Shell Scripting

Topics:

  • Turning a workflow into a script
  • Bash script parameters ($1 $2 $3 …)
  • if, then, else, fi

Assignments:

  • Work through BDS Chapter 12

  • Remember that “create a new project” script you wrote at the beginning of the semester?

    • Turn it into an interactive script where the user provides the name of the project
    • It should then generate a full project directory structure based on that name

Resources

Practice

  • Build a bash script that can:

    • determine the file extension of fasta, fasta.gz, fastq, fastq.gz
    • uses conditional statements to print the number of sequences in the file, regardless of format (as long as it’s one of those 4)
    • this forum exchange might help

Back to top of page

Putting it all together

Week 13

Composing Full Pipelines

Topics:

  • The duct tape of bioinformatics

  • Good pipelines need:

    • Documentation
    • Version control
    • Validation

Assignments:

  • Continue working through BDS Chapter 12

Resources

Back to top of page


Week 14

Running a Pipeline on a Remote Machine

Topics:

Assignments:

  • Case Study 3 - Assemble a metagenome on the remote cluster

    • metaSPADEs
    • classify reads with DIAMOND?

Resources

Practice

Back to top of page


Week 15

Creating a Custom Bioinformatics Tool

Topics:

  • Testing with toy examples

Assignments:

  • Case Study 4 - Download NCBI marker genes and use Unix tools to build a custom RDP-Classifier-compatible reference database

    • Reingineer https://github.com/gzahn/Format_NCBI_QIIME
    • Edirect (command-line version of NCBI search tool)
    • ftp, BLAST, NCBI, data cleaning and reformatting
    • Turn into a completely reproducible and portable script
    • requires entrez_qiime.py installation and use
    • has to be well-documented
    • push tool to GitHub
    • uses chapters: 2,3,6,7,10,12,5
    • script should automate download and building with helpful messages along the way

Back to top of page


Week 16

Where to go from here?

Topics:

Assignments:

  • Assignment 10 - Reflection piece on what you’ve learned and what next steps you’ll take

Back to top of page